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ABSTRACT 

Understanding social network structure and evolution has impor- 
tant implications for many aspects of network and system design 
including provisioning, bootstrapping trust and reputation systems 
via social networks, and defenses against Sybil attacks. Several re- 
cent results suggest that augmenting the social network structure 
with user attributes (e.g., location, employer, communities of inter- 
est) can provide a more fine-grained understanding of social net- 
works. However, there have been few studies to provide a system- 
atic understanding of these effects at scale. 

We bridge this gap using a unique dataset collected as the Google+ 
social network grew over time since its release in late June 2011. 
We observe novel phenomena with respect to both standard social 
network metrics and new attribute-related metrics (that we define). 
We also observe interesting evolutionary patterns as Google+ went 
from a bootstrap phase to a steady invitation-only stage before a 
public release. 

Based on our empirical observations, we develop a new gener- 
ative model to jointly reproduce the social structure and the node 
attributes. Using theoretical analysis and empirical evaluations, we 
show that our model can accurately reproduce the social and at- 
tribute structure of real social networks. We also demonstrate that 
our model provides more accurate predictions for practical appli- 
cation contexts. 

Categories and Subject Descriptors 

J.4 [Computer Applications] : Social and behavioral sciences 

Keywords 

Social network measurement, Node attributes, Social network evo- 
lution, Heterogeneous network measurement and modeling, Google+ 
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1. INTRODUCTION 

Online social networks (e.g., Facebook, Google+, Twitter) have 
become increasingly important platforms for interacting with peo- 
ple, processing information and diffusing social influence. Thus 
understanding social-network structure and evolution has impor- 
tant implications for many aspects of network and system design 
including bootstrapping reputation via social networks (e.g., [39|), 
defenses against Sybil attacks (e.g., (14|), leveraging social net- 
works for search [ 1], and recommender systems with social regu- 
larization [ 35) . 

Traditional social network studies have largely focused on un- 
derstanding the topological structure of the social network, where 
each user can be viewed as a node and a specific relationship (e.g., 
friendship, co-authorship) is represented by a link between two 
nodes. More recently, there has been growing interest in augment- 
ing this social network with user attributes, which we call as Social- 
Attribute Network (SAN). User attributes could be static (e.g., school, 
major, employer and city derived from user profiles), or dynamic 
(e.g., online interest and community groups). Recent studies have 
demonstrated the promise of social-attribute networks in applica- 
tions such as link predictio n |58|[l7) , attribute inference |17||58| , 
and community detection [162). 

Despite the growing importance of such social-attribute networks 
in social network analysis applications, there have been few efforts 
at systematically measuring and modeling the evolution of social- 
attribute networks. Most prior work in the measurement and mod- 
eling space focuses primarily on the social structure (3] [4] [13] |26| 
|28| |33| [38) . Measuring social-attribute networks can simultane- 
ously inform us the properties of social network structure, attribute 
structure, and how such attributes impact social network structure. 

In this paper, we present a detailed study of the evolution of 
social-attribute networks using a unique large-scale dataset col- 
lected by crawling the Google+ social network struture and its user 
profiles. This dataset offers a unique opportunity for us as we were 
fortunate to observe the complete evolution of the social network 
and its growth to around 30 million users within a span of three 
months. 

We observe novel patterns in the growth of the Google+ social- 
attribute network. First, we observe that the social reciprocity of 
Google+ is lower than many traditional social networks and is closer 
to that of Twitter. Second, in contrast to many prior networks, 
the social degree distributions in Google+ are best modeled by 
a lognormal distribution. Third, we observe that assortativity of 



Google+ social network is neutral while many other social net- 
works own positive assortativities. Fourth, we also see that the 
distinct phases (initial launch, invite only, public release) in the 
timeline of Google+ naturally manifest themselves in the social and 
attribute structures. Fifth, for the generalized attribute metrics (that 
we define), while some attribute metrics mirror their social counter- 
parts (e.g., diameter), several show distributions and trends that are 
significantly different (e.g., clustering coefficient, attribute degree). 
Finally, via the social-attribute network framework, we study the 
impact of user attributes on the social structure and observe that 
nodes sharing common attributes are likely to have higher social 
reciprocity and that some attributes have much stronger influence 
than others (e.g., Employer vs. City). 

Based on our observations, we develop a new generative model 
for SANs. Our model includes two new components, i.e., attribute- 
augmented preferential attachment and attribute-augmented triangle- 
closing, which extend the classical preferential attachment p] |27| 
and triangle-closing [29, 43, 53, 2 1, respectively. Using both theo- 
retical analysis and empirical evaluation, we show that our model 
can reproduce SANs that accurately reflect the true ones with re- 
spect to various network metrics and real-world applications. Such 
a generative model has a lot of applications |30 | such as network 
extrapolation and sampling, network visualization and compres- 
sion, and network anonymization |44|. 

To summarize, the key contributions of this work are: 

• We perform the first study of the evolution of social-attribute 
networks using Google+. We observe novel phenomena in stan- 
dard social structure metrics and new attribute-related metrics 
(that we define) and how attributes impact the social structure. 

• We develop a measurement-driven generative model for the 
social-attribute network that models the impact of user attributes 
into the network evolution. 

• Using both theoretical analysis and empirical evaluation, we 
validate that our model can accurately reproduce real social- 
attribute networks. 



2. PRELIMINARIES AND DATASET 

In this section, we begin with some background on augmenting 
social network structure with attributes. Then, we describe how 
we collected the Google+ data and how we augment the Google+ 
social network with user attribute information. We also present 
some basic measurements describing the evolution of the Google+. 

2.1 Social-Attribute Network (SAN) 

In this section, we review the definition of Social-Attribute Net- 
work (SAN) 1 17] and introduce the basic notations used in the rest 
of this paper. 

Given a directed social network G, in which nodes are users and 
edges represent friend relationships between users, and M distinct 
binary attributes, which could be static (e.g., name of employer, 
name of school, major, etc.) or dynamic (e.g., interest groups), 
a SAN is an augmented network with M additional nodes where 
each such node corresponds to a specific binary attribute. For each 
node u in G with attribute a, we create an undirected link between 
u and a in the SAN. 

Nodes in a SAN corresponding to nodes in G are called social 
nodes and denoted as the set V s , while nodes representing attributes 
are called attribute nodes and denoted as the set V a ■ Figure [T] 
shows an example SAN. Links between social nodes are called 
social links and denoted as the set E s , while links between so- 
cial nodes and attribute nodes are called attribute links and de- 
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Figure 1: Illustration of a SAN with six social nodes and four at- 
tribute nodes. Note that the social links between users are directed 
whereas the attribute-user links are undirected. 

noted as the set E a . Thus a Social- Attribute Network is denoted 

as SAN = (V s ,Va, E s , E a ). 

For a given social or attribute node u in a SAN, we denote its 
attribute neighbors as T a (u) = {v\v G V a , (u,v) G E a }, social 
neighbors as r s («) = {v\v G V s , (v,u) G E s U E a or (u,v) G 
E s U Ea}, social in neighbors as T s ^ n (u) = {v\ (v, u) G E s } and 
social out neighbors as T StOU t(u) — {v\(u, v) G E s }. Note that 
an attribute node can only have social neighbors. 

2.2 Google+ Data 

Google+ was launched with an invitation-only test phase on lune 
28, 2011, and opened to everyone 18 years of age or older on 
September 20, 2011. We believed this was a tremendous oppor- 
tunity to observe the real-world evolution of a large-scale social- 
attribute network. Thus, we began to crawl daily snapshots of pub- 
lic Google+ social network structure and user profiles; our crawls 
lasted from July 6 to October 11, 2011. The first snapshot was 
crawled by breadth-first search (without early stopping). On sub- 
sequent days, we expanded the social structure from the previous 
snapshot. For most snapshots, our crawl finished within one day as 
Google did not limit the crawl rate during that time. 

We believe our crawl collected a large Weakly Connected Com- 
ponent (WCC) of Google+. This may be surprising as many past 
attempts on Flickr, Facebook, YouTube etc., were unable to do 
so 1 3 8 1 . The key difference is that these were only able to access 
outgoing links. In contrast, each user in Google+ has both an out- 
going list (i.e., "in your circles") and an incoming list (i.e., "have 
you in circles"). This allows us to access both outgoing and incom- 
ing links making it feasible to crawl the entire WCC. 

We have two points of reference that suggest our coverage is high 
(> 70%): 1) TechCrunch estimated the number of Google+ users 
on July 12, 2011 is around 10 million |52|; our crawled snapshot 
on the same day has 7 million users. (2) Google announced 40 mil- 
lion users had joined Google+ in middle October 1 19|; our crawled 
snapshot on October 1 1 has around 30 million users. 

We take each user u in Google+ as a social node in SAN, and 
connect it to her outgoing friends via outgoing links and incom- 
ing friends via incoming links. We use four attribute types School, 
Major, Employer and City that were available and easy to extract. 
Specifically, we find all distinct schools, majors, employers and 
cities that appear in at least one user profile and use them as at- 
tribute nodes. Recall that a social node u is connected to attribute 
node a via an undirected link if u has attribute a. In this way, we 
construct a SAN from each crawled snapshot, resulting in 79 SANs 
during the period from July 6 to October 11, 2011. 

Figures [2] and [3] show the temporal evolution of the number of 
nodes and links in the Google+ SAN. From the results we clearly 
see three distinct phases in the evolution of Google+: Phase I from 
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Figure 2: Growth in the number of social and attribute nodes in the 
Google + dataset. 
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(a) Social links (b) Attribute links 

Figure 3: Growth in the number of social and attribute links in the 
Google + dataset 

day one to day 20, which corresponds to the early days of Google+ 
whose size increased dramatically; Phase II from day 21 to day 75, 
during which Google+ went into a stabilized increase phase; and 
Phase III from day 76 to day 98, when Google+ opened to public 
(i.e., without requiring an invitation), resulting in a dramatic growth 
again. We point this out because we observe a similar three-phase 
evolution pattern for almost all network metrics that we analyze in 
the subsequent sections. 

In the following sections, we use the last or largest snapshot, 
unless we are interested in the time-varying behavior. 

Potential biases: We would like to acknowledge two possible bi- 
ases. First, users may keep some of their friends or circles private. 
In this case, we can only see the publicly visible list. Thus we 
may not crawl the entire WCC and underestimate the node degrees. 
However, as discussed earlier, we obtain a very large connected 
component that covers more than 70% of known users which is 
sufficiently representative. Second, users may choose not to de- 
clare their attributes, in which case we may underestimate the im- 
pact of attributes on the social structure. However, we find that 
roughly 22% of users declare at least one attribute which repre- 
sents a statistically large sample from which to draw conclusions. 
Furthermore, by validating the attribute-related results via further 
subsampling the attributes we have, we show that our attributes are 
representative of the entire attributes. 

3. SOCIAL STRUCTURE OF THE 
GOOGLE+ SAN 

In this section, we begin by presenting several canonical network 
metrics commonly used for characterizing social networks such as 
the reciprocity, density, clustering coefficient, and degree distribu- 
tion (38 25, 28, 41]. These metrics are useful to expose the inher- 
ent structure of a social network in terms of the friend relationships 
and whether there are "community" structures beyond a one-hop 



friend relationship. It is particularly useful to revisit these metrics 
in the context of Google+ both because of its scale and because it 
enables a somewhat hybrid relationship model compared to other 
networks such as Facebook, Twitter, Flickr, and email networks. 
Furthermore, since we have a unique opportunity to observe the 
network as it grew, we also analyze how these properties changed 
as the Google+ SAN evolved. 

3.1 Reciprocity 

The reciprocity metric for directed social networks represents the 
fraction of social links that are mutual; i.e., if there is a A — > B 
edge what is the likelihood of the reverse B — > A edge. Pre- 
vious work studied the global reciprocities for specific snapshots 
of social networks and measured it to be 0.62 on Flickr, 0.79 on 
YouTube (38), and 0.22 on Twitter (28). We focus on the evo- 
lution of global reciprocity for Google+ in Figure [4a] The result 
shows an interesting behavior where the reciprocity fluctuates in 
Phase I, decreases in Phase II and decreases even faster in Phase 
III. We speculate that this arises because of the hybrid nature of 
Google+. Initially many people treat the network like a traditional 
social network (e.g., Facebook) where the relationships are mutual. 
However, as time progresses and people appear to become famil- 
iar with the Twitter-like publisher-subscriber model also offered by 
Google+, the reciprocity decreases. 

3.2 Density 

The ratio of links-to-nodes, j^j, captures the rfeniifyj^jof a so- 
cial network. To put this in context, previous studies show that the 
social density increases over time on citation and affiliation net- 
works [ 33 1, on Facebook [4], and fluctuates in an increase-decrease- 
increase fashion on Flickr |26| , and is relatively constant on email 
communication networks |25) . 

Figure [4b] shows the evolution of this social density metric in 
Google+. We observe that social density in Google+ network has 
a sharp decrease followed by an increase in Phase I, a continued 
increase in Phase II, and a sudden drop in Phase III (when Google+ 
opened to the public) followed by a steady increase again. This 
three-phase pattern can be explained in conjunction with the trends 
in Figures [2a] and [3a[ In the early part of Phase I, even though the 
rate of users joining Google+ is high, the rate of adding links is 
low, possibly because many of a user's existing friends have not 
yet joined. This causes social density to decrease. As users acquire 
friends with a rate higher than the rate of new users in later part of 
Phase I and the same trend continuing in Phase II, the social den- 
sity increases. In Phase III, the number of users in Google+ had 
a sudden jump due to the public release but the number friendship 
links increases less dramatically, which once again causes the so- 
cial density to drop significantly around t=70, but then starts slowly 
increasing again. Our findings have implications for network mod- 
eling. Specifically, many network models either assume constant 
density (5]|24| or power-law densification (35), which is not con- 
sistent with Google+. 

3.3 Diameter 

In directed social networks, the distance between two user nodes 
u and v, dist(u, v) is defined as the length of the shortest directed 
path whose head is v and tail is u. Note that only social links E s are 
used in this definition. We find that the distribution of the distance 
between nodes has a dominant mode at a distance of six, with most 
nodes (90%) having a distance of 5, 6, or 7 (not shown). 

'in graph theory, density is defined as the fraction of existing links 
with respect to all possible links. We follow the terminology in 1 26 1 
in order to compare with previous results. 
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Figure 4: Evolution of four key metrics: reciprocity, density, diameter and clustering coefficient on the Google+ SAN. In each case, we 
obsen'e distinct behaviors in the three phases corresponding to early initialization, time to public release, and time after public release. 

Based on the distance distribution, we can also define the effec- 
tive diameter as the 90-th percentile distance (possibly with some 
interpolation) between every pair of connected nodes 133]. Unfor- 
tunately, computing the effective diameter is infeasible for large 
networks, so we use the HyperANF approximation algorithm (8), 
which has been shown to be able to approximate diameter with high 
accuracy. 

Previous work observed effective diameter shrinks in citation 
networks, autonomous networks and affiliation network (33), in 
Flickr and Yahoo! 360 (26), and in Cyworld |3 j. However, we ob- 
serve that the effective diameter follows a three-phase evolution as 
seen in Figure|4c] which again can be explained in conjunction with 
the trends in Figures [2a] and [3a| In Phase I, user joining rate out- 
paces link creation rate, causing the diameter to increase; in Phase 
II, user joining rate is lower than link acquisition rate, resulting 
in decreasing diameter; and in Phase III user joining rate is much 
higher, resulting in a diameter increasing phase again. Again, our 
observations have implications for network modeling. Existing net- 
work models either assume logarithmically growing diameter 1 5 5 , 
5 1 or shrinking diameter 1 30 33 J. 

3.4 Clustering Coefficient 

Given a network G and node u, it's clustering coefficient is de- 
fined as 
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Figure 5: Indegree and outdegree distributions for the social nodes 
in the Google+ SAN along with their best-fit cur\>es. We observe 
that both are best modeled by a discrete lognormal distribution un- 
like many networks that suggest power-law distributions. 
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where L(u) is the number of links among it's social neighbors 
r s (M) and the average social clustering coefficient is defined as 
C s = y^-r ~}2 uev c(u) |55|. Intuitively, this captures the commu- 
nity structure among a user's friends. 

Again, computing the average clustering coefficient is expensive. 
Thus, we extend the constant-time approximate algorithm proposed 
by Schank et al. for undirected networks (45), and develop an al- 
gorithm to approximate the clustering coefficients for a directed 
network. With [^prl random samples, our constant time algo- 
rithms can bound the error of average clustering coefficient within 
e with probability at least 1 — i. In practice, we set the error to be 
e = 0.002 and v — 100. Algorithm details and theoretical analysis 
can be found in Appendix |A| 

Kossinets et al. |25 | observed constant average social clustering 
coefficient over time in an email communication network. How- 
ever, we find that the evolution of average social clustering coef- 
ficient of Google+, which is shown in Figure |4d] again follows a 
three-phase evolution pattern where the clustering coefficient dra- 
matically decreases in Phase I, increases slowly in Phase II and de- 
creases again in Phase III. Our findings indicate that the community 
structure among users' friends is highly dynamic, which inspires us 
to do dynamic community detection. 
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Figure 6: Evolution of the lognormal parameters for the indegree 
and outdegree distributions. 

3.5 Degree Distributions 

Next, we focus on the social indegree and outdegree of users in 
Google+. In each case, we are also interested in identifying an em- 
pirical best-fit distribution using the tool |54||10| , which compares 
fits of several widely used distributions (e.g., power-law, lognor- 
mal, power-law with cutoff using) with respect to goodness-of-fit. 
We find that unlike many studies on social networks, in which so- 
cial degrees usually follow a power-law distribution |13| |38| , so- 
cial degrees are best captured by a discrete lognormal distribution 
in Google+. Recall that a random variable x £ Z + follows a 
power-law distribution if p(x — k) oc fc~ a , where a is the ex- 
ponent of the power-law distribution. On the other hand, a ran- 
dom variable x £ Z + follows a discrete lognormal distribution if 

p(x — k) oc ^exp(— ^"g'f" 1 ) 171, where p, and a are the mean 
and standard deviation respectively of the lognormal distribution. 

Figure [5] shows these degree distributions and their discrete log- 
normal fits, and Figure|6]shows the evolutions of the parameters for 
the fitted discrete lognormal distributions. We see the evolution of 



the outdegree and indegree distributions follows a similar trend but 
with the fluctuation differing in magnitude (Figures [6a||6b| l. 

Lognormally distributed degree distributions imply that there are 
probabilistically more low degree social nodes in Google+ than 
those in power-law distributed networks. 
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(a) k nn metric (b) Assortativity 

Figure 7: Two metrics for capturing the joint-degree distribution: 
(a) k„„ shows a log-log plot of the outdegree versus the average 
indegree of friends and (b) shows the evolution of the assortativity 
coefficient. 

3.6 Joint Degree Distribution 

Last, we examine the joint degree distribution (JDD) of the Google+ 
social structure. JDD is useful for understanding the preference of 
a node to attach itself to nodes that are similar to itself. One way to 
approximate the JDD is using the degree correlation function k nn , 
which maps outdegree to the average indegree of all nodes con- 
nected to nodes of that outdegree [42 38 1. An increasing k nn trend 
indicates high-degree nodes tend to connect to other high-degree 
nodes; a decreasing k nn represents the opposite trend. Figure [7a| 
shows the k nn function for Google+ social structure. 

The JDD can further be quantified using the assortativity coef- 
ficient r that can range from -1 to 1 [41|. r is positive if k nn is 
positively correlated to node degree k. Figure|7b]illustrates the evo- 
lution of the assortativity coefficient. We observe that r keeps de- 
creasing in all three phases but at different rates. Furthermore, un- 
like many traditional social networks where the assortativity coeffi- 
cient is typically positive — 0.202 for Flickr, 0.179 for LiveJournal 
and 0.072 for Orkut |38[|41| — Google+ has almost neutral assorta- 
tivity close to 0. The neutral assortativity can possibly be explained 
by the hypothesis that Google+ is a hybrid of two ingredients, i.e., a 
traditional social network and a publisher-subscriber network (e.g., 
Twitter). Traditional social networks usually have positive assor- 
tativity; publisher-subscriber networks often have negative assorta- 
tivity because high-degree publisher nodes tend to be connected to 
low-degree subscriber nodes. Thus a hybrid of them results in a net- 
work with neutral assortativity. The evolution pattern of Google+' 
assortativity coefficient (i.e., positive in Phase I, around in Phase 
II, and negative in Phase III) manifests the competing process of 
the two ingredients of Google+. More specifically, the traditional 
network ingredient slightly wins in Phase I, resulting in a slightly 
positive assortativity coefficient. A draw between them in Phase 
II results in the neutral assortativity. In Phase III, the publisher- 
subscriber ingredient wins, resulting in a slightly negative assorta- 
tivity coefficient. This implies that Google+ is more and more like 
a publisher-subscriber network. 

3.7 Summary of Key Observations and Impli- 
cations 

Analyzing the social structure of Google+ and its evolution over 
time, we find that: 



• In contrast to many traditional networks, we find that Google+ 
has low reciprocity, the social degree distribution is best mod- 
eled by a lognormal distribution rather than a power-law distri- 
bution, and the assortativity is neutral rather than positive. 

• Google+ is somewhere between a traditional social network 
(e.g., Flickr) and a publisher-subscriber network (e.g., Twitter), 
reflecting the hybrid interaction model that it offers. Moreover, 
it's more and more closer to a publisher-subscriber network. 

• The evolutionary patterns of various network metrics in Google-t- 
are different from those in many traditional networks or as- 
sumptions of various network models. These findings imply 
that existing models cannot explain the underlying growing mech- 
anism of Google+, and we need to design new models for re- 
producing social networks similar to Google+. 

4. ATTRIBUTE STRUCTURE OF THE 
GOOGLE+ SAN 

In the previous section we looked at well-known social network 
metrics. In this section, we focus on analyzing the attribute struc- 
ture of the Google+ SAN. To this end, we extend the metrics from 
the previous section to the attributes as well. Finally, we show the 
importance of using attributes in understanding the social struc- 
ture by studying their impact on metrics we analyzed earlier (e.g., 
reciprocity, clustering coefficient, and degree distribution). These 
attribute-related studies will characterize the attribute structure, give 
us insights about the underlying growing mechanism of Google+, 
and eventually guide us design a new generative model for Google+ 
SAN. 
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(a) Attribute density (b) Clustering coefficient 

Figure 8: Evolution of the attribute density and average attribute 
clustering coefficient in the Google+ SAN. 
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(b) Comparison with a subsampled SAN 



Figure 9: Distributions of clustering coefficient with respect to node 
degrees, (a) Comparison of social and attribute clustering coeffi- 
cient distributions in the original SAN. (b)Comparison of distribu- 
tions of attribute clustering coefficients in the original SAN and the 
subsampled SAN. 



4.1 Attribute Metrics 

Density: W e co nsider a natural extension of the social density 
metric from 5 3.2 and define attribute density as jfH- Different 
from our observations with social density in Figure[4b] in Figure[8a| 
We observe the attribute density increases rapidly in Phase I, stays 
relatively flat in Phase II, and slightly decreases in Phase III. The 
reason for the decrease in Phase III is the large volume of new 
(i.e., non-invitation) users joining Google+ with many new attribute 
nodes whose social degrees are small. 

Diameter: We extend the distance metric from j |3.3| to define the 
attribute distance between two attribute nodes a and b as dist(a, b) = 
min{dist(u, v)\u £ r a (a),?j £ T s (6)} + 1 ^ Intuitively, attribute 
distance is the minimum number of social nodes that a attribute 
node has to traverse before reaching to the other one; i.e., attribute 
distance is the distance between two attribute communities. Sim- 
ilarly, we can consider the effective diameter using this attribute 
distance. Figure [4c] also shows the evolution of the attribute diam- 
eter and shows that it very closely mirrors the social diameter. 

Clustering coefficient: Similarly, we generalize the social clus- 
tering coefficient from j |3.4| to define the attribute clustering coeffi- 
cient c(u) for the attribute node u, and the average attribute cluster- 
ed). This attribute cluster- 



ing coefficient as C a 
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ing coefficient c(u) characterizes the power of attribute u to form 
communities among users who have the attribute u. Compared to 
Figure|4d] we find in Figure[8b]that the average attribute clustering 
coefficient evolves in a different pattern since it's relatively stable 
in Phase II. 

We also show the distribution of average social and attribute clus- 
tering coefficients as a function of node degree in Figure [9a] We 
observe that both social and attribute clustering coefficients follow 
a power-law distribution with respect to node degrees, but attribute 
clustering coefficient distribution has a larger exponent. Moreover, 
we see that in general attribute clustering coefficients are lower be- 
cause many shared attributes (e.g., city or major) will not naturally 
translate into a social relationship. 
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(a) Attribute degree of social nodes (b) Social degree of attribute nodes 

Figure 10: Distributions of attribute-induced degrees in the 
Google+ SAN along with their best fits. The attribute degree of 
social nodes is best modeled by a lognormal whereas the social de- 
gree of attribute nodes is best modeled by a power-law distribution. 

Degree distributions: As discussed earlier, SANs introduce edges 
between social and attribute nodes. Thus, we consider two new no- 
tions of node degrees: (1) social degree of attribute nodes (i.e., the 
number of users that have this attribute) and (2) attribute degree 
of social nodes (i.e., the number of attributes each user has). We 
find that the attribute degree of social nodes is best modeled by a 
lognormal distribution whereas the social degree of attribute nodes 



Other definitions are possible, e.g, using average instead of min. 
We choose min because of its computational efficiency. 
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Figure 11: Evolution in lognormal and power-law parameters for 
the attribute and social degree distributions 



is best modeled by a power-law distribution. Figure [10] and Fig- 
ure [TT] show the degree distributions and evolution of their fitted 
parameters. 

In terms of the evolution, we find the attribute degree evolution 
seen in Figure [TT]i s significantly different from the previous obser- 
vation in Figure |6] its mean decreases in Phase I, remains roughly 
constant in Phase II, and decreases again in Phase III. However, 
its standard deviation increases slightly in all phases. Finally, for 
the social degree which follows a power-law distribution, the expo- 
nent decreases fast in Phase I, and increases slightly in Phase II and 
Phase III. 
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(b) Evolution of assortativity 

Figure 12: (a) Joint degree of attribute nodes: Log-log plot of the 
social degree versus the average attribute degree of social neigh- 
bors of attribute nodes, (b) The evolution of the attribute assorta- 
tivity coefficient. 

Joint degree distribution: Next, we extend the joint degree distri- 
bution (JDD) analysis to attribute nodes. For each social degree k, 
we compute k nn as the average attribute degree of social neigh- 
bors of attribute nodes that have social degree k. Intuitively, it 
captures the tendency of attribute nodes with high social degree 
to connect to social nodes with high attribute degree; i.e., if many 
nodes share a particular attribute, then are these nodes likely have 
many attributes? Figure [12] shows the k nn function for attribute 
JDD and the evolution of the attribute assortativity. Intuitively, 
we expect this relationship to be neutral and the result confirms 
this intuition; e.g., there are many Google+ users in New York but 
that does not imply the people in New York have many attributes. 
One interesting observation is that attribute assortativity coefficient 
evolves slightly differently compared to social assortativity coeffi- 
cient (FigureJTbJ; it is stable in Phase III whereas social assortativ- 
ity decreases significantly. 

4.2 Influence on Social Network Structure 

Next, we look at how attributes influence the social structure of 
the Google+ SAN w.r.t the metrics discussed in ^3] 
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Figure 13: Influence of attribute on reciprocity and clustering co- 
efficients. 

Reciprocity: We study how the number of common attribute neigh- 
bors influences reciprocity in conjunction with the number of com- 
mon social neighbors. Let a and s denote the number of attribute 
and social neighbors of a given node, respectively. For each pair 
(s, a), we compute r s . a as the percentage of links that are recip- 
rocal among all the links whose endpoints have s social neighbors 
and a attribute neighbors. 

To compute this, we look at all one directional links at the snap- 
shot collected halfway and then compute the number of such links 
that become bidirectional at the last snapshot. We split these by the 
number of common social and attribute neighbors between these 
nodes at the halfway stage and show the r Sta values in Figure [T3| 
We see that the reciprocity is almost twice as high for nodes that 
share common attribute neighbors compared to nodes without com- 
mon attributes, regardless of the number of common social neigh- 
bors. While sharing common social neighbors improves link reci- 
procity, there is a natural diminishing returns property beyond 10 
common social neighbors, and even decreasing for much larger 
values. We speculate that nodes sharing too many social neigh- 
bors are likely users with many "weak" ties. For recent reciprocity 
prediction problem (9]|21|, our findings imply that any reciprocity 
predictor should incorporate node attributes instead of pure social 
structure metrics. 

Clustering coefficient: Next, we compute the average attribute 
clustering coefficient for the 4 attribute types: Employer, School, 
Major and City. For example, we compute the attribute clustering 
coefficients for all attribute nodes belonging to the attribute type 
Employer, and then average them to obtain the average attribute 
clustering coefficient for Employer. Figure [T3b"] shows that attribute 
types vary in their influence on forming communities and that users 
with the same Employer attributes are much more likely to form 
communities than users sharing other attribute types. This has in- 
teresting implications for link prediction and attribute inference. 
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Figure 14: Influence of attribute on social degree 



Degree distribution: For brevity, we only focus on the Employer 
and Major attributes and show the result for the top attribute values 
observed within each category. We plot the median, 25 th , and 15 th 
percentile of the social outdegree of nodes that have these attribute 
values in Figure [14] We see that the users with Employer=Google 
and Major=Computer Science are likely to have higher degrees. 
We also computed the full degree distributions for these attribute 
values and saw that they follow different lognormal distributions 
(not shown). We speculate this could be a specific artifact of the 
Google+ network as many of the early adopters likely consist of 
Google employees and users in the IT/CS industry. 

4.3 Validation via Subsampling 

One natural question is whether the attributes of 22% of users we 
collected is a good representative of the entire attributes. To this 
end, we use subsampling method to validate our attribute-related 
results. We use attribute clustering coefficient distribution with re- 
spect to node degrees as an example, and observe similar results 
for other metrics. For each user with attributes, we remove her at- 
tributes with probability 0.5, from which we obtain a subsampled 
SAN. Then we calculate the attribute clustering coefficient distribu- 
tions for the original and this subsampled SANs. Figure [9b| shows 
that the results of the original and subsampled SANs are almost 
identical. Given the assumption that whether a user fills in her at- 
tributes is a random and independent event, our results demonstrate 
that the attributes of 22% of users is a representative sample of the 
attributes of all the users. 

4.4 Summary of Key Observations and Impli- 
cations 

In this section, we studied the attribute structure of the Google+ 
SAN and how such attribute structure impacts the social structure. 
Our key observations are: 

• While some attribute metrics mirror their social counterparts 
(e.g., diameter), several show distributions and trends that are 
significantly different (e.g., clustering coefficient, attribute de- 
gree). These observations will guide us to design models for 
SAN. 

• We confirm that attributes have interesting impact on the so- 
cial structure, e.g., nodes are likely to have higher reciprocity 
if they share common attributes. These findings have various 
implications. For instance, reciprocity predictor should incor- 
porate node attributes. 

• We also observe that some attribute types naturally have stronger 
influence than others. For example, users sharing the same em- 
ployer have higher probability to be linked compared to users 
sharing the same city. Data mining tasks such as link prediction 
and attribute inference should potentially benefit from these 
findings. 

5. A GENERATIVE MODEL FOR SAN 

From the previous sections, we have seen novel phenomena in 
the social and attribute structure of the Google+ SAN and that the 
attribute structure impacts the social structure significantly. A nat- 
ural question is whether we can create an accurate generative net- 
work model that can reproduce both the social and attribute struc- 
tures we observe. Such a generative model can help us understand 
the growing mechanism of SAN, and allow other applications such 
as network extrapolation and sampling, network visualization and 
compression, and network anonymization (30). 

Prior work on generative models focus primarily on the social 
structure [5| |24| |2| |33| |29| . Consequently, these approaches can- 
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Figure 15: Comparison between Power Attribute Preferential At- 
tachment (PAPA) and Linear Attribute Preferential Attachment 
( LAPA ) models. All result numbers are percentage of relative im- 
provements over the loglikelihood of the PA model, i.e., with a =1 
and /3 = 0. 

not model the attribute structure or their impact on social structure. 
To address this gap, we provide a new generative model taking into 
account the attribute structure from first principles rather than over- 
laying it after-the-fact. To this end, we extend a prior generative 
model [29 1, using attribute- augmented models for link generation 
and addition , which are key building blocks for such generative 
models. As we will show, this provides more realistic synthetic 
SAN that closely matches the Google+ SAN. 

5.1 Building Block 1: 

Attribute- Augmented Preferential Attach- 
ment 

Leskovec et al. showed that the Preferential Attachment (PA) 1 5 1 
is a suitable choice for creating edges |29|. The key idea in PA is 
that a new node u is likely to connect to an existing node v with 
a probability proportional to v's degree. As we saw earlier, users 
who share attributes are also more likely to be connected. Thus, we 
consider two ways to augment the PA model: 

• Power Attribute Preferential Attachment ( PAPA ): 

f(u,v)(xd l {v) a {l + a(u,vf) 

• Linear Attribute Preferential Attachment (LAPA): 
f(u,v) oc di(,v) a (l + B ■ a(u,v)) 

Here, f(u,v) is the probability with which social node u adds 
a link to social node v, di(v) is the indegree of v and a(u, v) is 
the number of common attributes that social nodes u and v shareQ 
Notice that when a = /3 = 0, both reduce to a uniform distribution 
(i.e., v is sampled uniformly at random) and when a=l,/3=0 both 
reduce to the PA model. 

The relative improvement of a model with parameter a, B over 
the PA model is defined as ipA "''°^' where I denote the log- 
likelihood of the model with respect to the empirically observed 
Google+ SAN. Figure[l5]shows the relative improvements of these 
models over the PA model for varying values of a, B. First, LAPA 
models perform better than PAPA models, which indicates that at- 
tribute likely influence friend requests in a linear way. Second, the 
PA model (a =1, /? = 0) is 7.9% better than a uniform random 
model (a =0, P = 0). A LAPA model with a =1 and /3 = 200 
achieves a further 6.1% improvement over the PA model. Third, 
a = 1 achieves the best loglikelihood for any given B, which in- 
dicates that social degree has a linear effect on friend requests. In 

3 In a more general setting, we can also weight attribute types dif- 
ferently; e.g., Employer is stronger than City. 
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summary, we conclude that there is a combined linear effect of both 
social degree and attributes. 

5.2 Building Block 2: 

Attribute- Augmented Triangle- Closings 

Triangle closing, where a node u selects a node v from its 2-hop 
neighbors and adds an edge, is an essential part of many genera- 
tive network models (29} [M] [33] [2] [43] [53]. We explore if node 
attributes can improve triangle closing. 

In the context of SAN, we can consider two types of triangle- 
closing: one is closing a triangle with no attribute node involved 
(e.g., ii4 — > U2 in Figure [TJ, and the other is closing a triangle 
which includes an attribute node (e.g., iii — > U2 in FigureJTJ. Fol- 
lowing prior work, we refer them as triadic and focal closure re- 
spectively |25| . In the friend requests we observe in Google+, 84% 
percent are triadic (common friend), 18% percent are focal (com- 
mon attribute), and 15% percent are cases where the nodes share 
both common friends and common attributes (e.g., uq — > us in 
Figure [TJ. 

This suggests the importance of incorporating attributes in the 
triangle closure. To this end, we consider three models: 

• Baseline: Select a social neighbor v within a 2-hop radius uni- 
formly at random. 

• Random-Random (RR): Select a social neighbor w £ F s (w) 
uniformly at random, and then select a social neighbor v £ 
r s (w) uniformly at random which is shown to have very good 
performance in previous work (29). 

• Random-Random-SAN (RR-SAN): select a neighbor w G r s (w)U 
T a (u) uniformly at random, and then select a social neighbor 
dGTs (w) uniformly at random]^] 

We compare these models using friend requests that are triadic clo- 
sures, focal closures, or both. Our experimental results confirm 
that RR model performs 14% better than the Baseline model |29|, 
and our RR-SAN model performs 36% better than RR model. This 
confirms that attributes play a significant role in the triangle-closing 
phenomenon as well and has natural implications for applications 
such as link prediction and friend recommendation. 

5.3 Our Generative Model for SAN 



We also tried a weighted model where we select neighbors pro- 
portional to link weights. For brevity, we do not show this because 
it performs similarly. 



Our stochastic process models several key aspects of SAN evo- 
lution: node joining, how nodes issue outgoing links and receive 
incoming links, and how they link to attribute nodes. The key dif- 
ferences from prior work |29 | are the two building blocks we de- 
scribed earlier: Linear Attribute Preferential Attachment (LAPA) 
and Random-Random-SAN (RR-SAN) triangle-closing. 

Here, nodes arrive at some pre-determined rate. On arrival, each 
node picks an initial set of attributes and social neighbors (using the 
LAPA model). After joining the network, each node subsequently 
"sleeps" for some time, wakes up, and adds new links based on the 
RR-SAN model. We describe the model formally in Algorithm [T] 
and discuss each step next. From the analysis below, we find that 
the key step for generating lognormal social outdegree distribution 
is to make the lifetime of nodes follow a truncated normal distri- 
bution. 



Initialization: The SAN is initialized with a few social and at- 
tribute nodes and links. We observed that the starting point has 
no detectable influence when the number of initialization nodes is 
small compared to the overall network. We currently use a com- 
plete social-attribute network with 5 social nodes and 5 attribute 
nodes. 

Social node arrival: Social nodes arrive as predicted by a node 
arrival function N(t), which could be estimated from real social 
networks. In our simulations, we simply let N(t) = 1 modeling 
each node arrival as a discrete time step. 

Attribute degree: Each node picks some number of attributes sam- 
pled from a lognormal distribution with mean fi a and variance of. 

Attribute linking: Each new social node v new with n a (v new ) at- 
tributes, we connect it to n a (v nem ) attribute nodes with the stochas- 
tic process defined as follows: for each attribute, with probability p, 
a new attribute node a is generated; otherwise an existing attribute 
node a is chosen with probability proportional to its social degree. 

First outgoing links: Each new node issues an outgoing link to a 
social node according to the LAPA model. 

Lifetime sampling: The lifetime / of v new is sampled from a trun- 
cated normal distribution, i.e., p(l) oc exp(— - ) f° r / > Q. 
(Prior models use an exponentially distributed lifetime value |29| 

HD-) 

Sleep time sampling: Sleep time s of any node v with outdegree 
d a can be sampled from any distribution with mean m s /d . Our 
model only depends on mean sleep time. The intuition of making 
mean sleep time reversely proportional to outdegree is that a node 
with larger outdegree has higher tendency to issue outgoing links. 
(Prior models assume a power-law with cutoff distributed lifetime 
value (29)^1).) 

Outgoing linking. Each woken social node v wo k en issues a new 
outgoing link according to our RR-SAN triangle-closing model. 

5.4 Theoretical Analysis 

By design, the attribute degree distribution of social nodes fol- 
lows a lognormal distribution. Next, we show via analysis that the 
outdegree of social nodes and the social degree of attribute nodes 
follow a lognormal and power-law distribution respectively. For 
brevity, we provide a high-level sketch of the proofs. 

Let 4>{x) and denote the probability density function and 
cumulative density function of standard normal distribution. Let 
H = -%,9(i) = 13*^0 and = 3(7) (3(7) - 7). 



THEOREM 1. If the sleep time is sampled from some distribu- 
tion with mean m s /d , then the social out degrees of SANs gen- 
erated by our model follow a lognormal distribution with mean 
(jj,t + aig(ji))/m s and variance of (1 — S(yi))/m*. 

PROOF. For any social node v, assume its final outdegree is D , 
then we have 



where s(d ) is the random sleep time whose mean is m B /d . Thus, 
with mean-field approximation, we obtain 



D 1 



< 1. 



Moreover, according to Euler's asymptotic analysis on harmonic 
series, we have 



v 1 



do 



\nD . 



That is, InDo ~ l/m s . Lifetime I is also a normal distribution 
truncated for I > 0, thus having mean p,% + crigi^i) and variance 
of (1 — S(ji)). Thus, lnD follows a truncated normal distribution 
with mean fi a = (p>i + o~ig(p(i))/m a and variance of = Oj 2 (l — 
<5(7i))/ m s- So D follows a lognormal distribution with mean /i 
and variance of. □ 

Next, we derive the distribution of social degree of attribute nodes 
using mean-field rate equations j3J. 

THEOREM 2. The social degrees of attribute nodes in the SANs 
generated by our model follow a power-law distribution with expo- 
nent 

1-p 

PROOF. Without loss of generality, we assume one attribute link 
joins the SAN at each discrete time step. Let Di denote the social 
degree of the attribute node i that joins the network at time t%. Ac- 
cording to the stochastic process in our algorithm, we have 



dDj 
dt 



(1-p) A _ (1-p) A 



Ei D i t+ m 

,where mo is the initial number of attribute links. Solving this or- 
dinary differential equation with initial condition Di = 1 at t = U 
gives us 

D _ = ,t + m 
U + mo 
So the probability of Di < D is 



Pr(Di < D) = 1 - Pr(U + m < (t + m )D i-p ). 

According to our model, Pr(ti) has a uniform distribution over the 
set {1, 2, ■ • ■ , t}. Thus we obtain 



Pr(D, <£>) = ! 



(t + mo)D !-p — mo 



t 



Then the distribution of Di can be calculated as 



Pr{D) 



dPr(Di < D) _ t + mo 2-p 



dD 



t(l-p) 



As t — > 00, we obtain Pr(D) oc D r-5 . So the social degrees 
of attribute nodes follow a power-law distribution with exponent 
□ 
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Figure 16: Degree distributions of synthetically generated SAN using our model in (a)-(d) vs. Zhel shown in (e)-(h). 

Mitzenmacher [ 40 ] did a comprehensive study on generative mod- io 1 

els (e.g., PA, multiplicative models, random monkey) for power- 
law and lognormal distributions. In this work, we have proposed 
two new generative models. 



6. EVALUATION 

In this section, we validate our SAN generative model. Because 
the SAN area is still very nascent there are few standard models 
of comparison. We pick the closest generative model by Zheleva 
et al 1 61 ]. Note that their model is actually orthogonal to ours since 
it's modeling dynamic node attributes while ours is modeling static 
node attributes. Furthermore, their original model generates undi- 
rected social networks. In order to compare with our model and di- 
rected Google+ SANs, we extend their model to generate directed 
social network^] We refer to the extended model as the Zhel model 
throughout this section. We start with network metrics, including 
single-node degree distribution, joint degree distribution and clus- 
tering coefficient. Then, following the spirit of [43], we also evalu- 
ate our model using real application contexts. 

For comparison, we use the Google+ snapshot crawled on July 
15, 201 1, which has roughly 10 million nodes and we believe it is 
representative of Google+ SAN. Using this Google+ snapshot, we 
run a guided greedy search to estimate appropriate parameters for 
our model and Zhel to generate synthetic SAN that best match the 
Google+. 

6.1 Network Metrics 

In this section, we qualitatively compare our model to the Zhel 
model, and demonstrate that our model can generate synthetic SAN 
that better reproduces various network metrics closer to Google+ 
SAN. 

Degree distributions: We first examine the degree distributions of 
the synthetic SAN generated by our model and the Zhel model in 
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Figure 17: Joint degree and clustering coefficient distributions of 
our model (a)-(b) vs. Zhel in (c)-(d). 



~ Extending their model is straightfoward. For instance, when the 
original model issues an undirected link, we change it to be a di- 
rected outgoing link. 



Figure [T6] The most visually evident result looking at Figure [16a| 
and Figure fT6b"| is that our model can generate synthetic networks 
with social indegree and outdegree following lognormal distribu- 
tions similar to the Google+ SAN that we saw in Figure [5] In con- 
trast, Figure [76f] and Figure [T6e| confirm that the Zhel model gen- 
erates indegree and outdegree following power-law distributions. 
Similarly, comparing Figure |16c| and |16g| to Figure |10a| the at- 
tribute degree of social nodes in our model follows the lognor- 
mal distribution that matches that of the Google+ SAN, whereas 
the Zhel model generates attribute degrees that follow a power- 
law distribution. Finally, Figure [T6d|and[T6h|confirm that both our 
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(a) Social indegree w/o LAPA (b) Clustering coefficient w/o focal closure 

Figure 18: The effect of LAPA and focal closure. 

model and Zhel generate social degrees of attribute nodes that fol- 
low power-law distribution, which is again consistent with Google+ 
SAN from FigurefTOb] 

Joint degree distributions: The ability to mirror more fine-grained 
properties beyond the degree distributions has been shown to be a 
key metric for evaluating generative models (37). Thus, we look 
at the joint degree distribution approximated by degree correla- 
tion function k nn in Figure |17a| and |17c| for our model and Zhel. 
Compared to Figure [T2] we see that the JDD of attribute nodes in 
our model generated SAN matches Google+ SAN much better than 
Zhel. We observe similar pattern for JDD of social nodes. 

Clustering coefficient: Fig. |17b| and Fig. |17d| shows the clustering 
coefficient distributions of synthetic SANs generated by our model 
and Zhel, respectively. When comparing them to Fig. [9a] we see 
that our model generates synthetic SAN with both social and at- 
tribute clustering coefficient distributions matching well to those of 
Google+ SAN, which is not the case for Zhel. 

Significance of building blocks: Recall that our model has two 
key building blocks that extend preferential attachment via LAPA 
and also extending triangle closing via focal closure. A natural 
question is what each of these components contribute toward the 
overall generative model. 

First, we investigate how LAPA impacts the structure of the gen- 
erated SAN in our model. To this end, we consider an interme- 
diate model with the classical PA (but with the RR-SAN enabled) 
and compute the previous metrics for SANs generated by this in- 
termediate model. We find that all metrics except the distribution 
of social indegree are qualitatively the same. Figure |18a| shows 
that the distribution of social indegree of the synthetic SAN gener- 
ated by our intermediate model is very close to a power-law distri- 
bution, different from the lognormal distribution generated by our 
full model shown in Figure [T6b| and derived from the real Google+ 
SAN shown in Figure [5] This suggests that the LAPA component 
is necessary for modeling a key aspect of the Google+ SAN. 

Second, we investigate the impact of RR-SAN. The key met- 
ric impacted by the focal closure component of RR-SAN is the 
attribute clustering coefficient. Figure |18b| shows the social and 
attribute clustering coefficients of synthetic SANs generated by our 
model without RR-SAN (with classical RR enabled). Looking at 
Figure |17b| and Figure |18b| together, we see that RR-SAN has a 
significant impact on the attribute clustering coefficient. 

These results confirm both attribute-augmented building blocks, 
LAPA and RR-SAN, play important but complementary roles in 
our model in generating synthetic SAN that closely mirrors the real 
Google+SAN. 

6.2 Application Fidelity 

Next, we use two real-world application contexts to evaluate the 
fidelity of our generative model and the Zhel model with respect to 
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Figure 19: Application fidelity of our model, (a) Sybil defense: 
SybilLimit false negatives as a function of number of compromised 
nodes, (b) Social network based anonymity: Probability of end-to- 
end timing analysis as a function of number of compromised nodes. 

a real Google+ snapshot. In each case, we use the metric of interest 
relevant to each application. Note that all these applications only 
rely on the social structure. 

Sybil defense: In a Sybil attack [ 14|, a single entity emulates the 
behavior of a large number of identities to compromise the secu- 
rity and privacy properties of a system. Sybil attacks are of par- 
ticular concern in decentralized systems, which lack mechanisms 
to vet identities and perform admission control. Several recent 
works have proposed the use of social trust relationships to miti- 
gate Sybil attacks |14||59| . Next, we show the fidelity of our model 
using a representative social network based Sybil defense mecha- 
nism called SybilLimit [59]. 

In order to prevent an adversary from obtaining a large number 
of attack edges (edges between compromised and honest users), 
SybilLimit bounds the effective node degree in the social network 
topology. Following their guidelines, we also imposed a node de- 
gree bound of 100 in evaluating their proposal on the different 
SANs. Figure [T9a] depicts the number of Sybil identities that an ad- 
versary can insert, as a function of number of compromised nodes 
in the network. We compromised the nodes uniformly at random, 
and set the SybilLimit parameter w — 10. The parameter fc gov- 
erns the attribute link weight in our RR-SAN component; fc — 
means no focal closure. 

We can see that (a) SybilLimit results using the synthetic topol- 
ogy generated by our model are a close match to the real Google+ 
data, and (b) our model outperforms the baseline approach (Zhel 
model). For example, when the number of compromised nodes 
is 200,000 the average number of Sybil identities in the Google+ 
topology is about 25.3 million, while our model predicts 24.5 mil- 
lion (error of 3.1% using fc = 0.1). In contrast, the baseline ap- 
proach has almost 4x worse error with a prediction error of 12.5%. 
This shows the importance of using attribute information to influ- 
ence the structure of the social structure (the Zhel model only uses 
the social structure to influence the attribute structure.) 

Anonymous communication: Anonymous communication aims 
to hide user identity (IP address) from the recipient (destination) 
or from third parties on the Internet such as autonomous systems. 
The Tor network [12] is a deployed system for anonymous com- 
munication that serves hundreds of thousands of users a day. It 
is widely used by political dissidents, journalists, whistle-blowers, 
and even law enforcement/military. Recent work )22| |11| has pro- 
posed leveraging social links in building anonymous paths for im- 
proving resistance to attackers. For example, the Drac [ 1 1 1 system 
selects proxies (onion routers) by performing a random walk on the 
social network. For low-latency communications, if the first and the 
last hops of the forwarding path (onion routing circuit) are compro- 



mised, then the adversary can perform end-to-end timing analysis 
and break user anonymity. Figure |19b| depicts the probability of 
end-to-end timing analysis when random walks on social networks 
are performed for anonymous communication, using the Google+ 
social network and our synthetic network. Similar to our Sybil- 
Limit experiments, we compromise nodes uniformly at random in 
the network, and impose an upper bound of 100 on the node de- 
gree. Again, we can see the accuracy of our model, as well as the 
improvement over prior work. 

6.3 Summary 

Via evaluating our model with respect to network metrics and 
real-world applications, we find that: 

• Our model can reproduce SANs that well match Google+ SAN 
with respect to various network metrics (e.g., degree distribu- 
tions, joint degree distributions and clustering coefficients.), but 
the Zhel model cannot match several metrics (e.g., social de- 
gree distributions, joint degree distributions and clustering co- 
efficient.). 

• Our model also performs better than the Zhel model for real- 
world applications such as Sybil defense and anonymous com- 
munication. 

• The two attribute-augmented building blocks, i.e., LAPA and 
RR-SAN, play important but complementary roles in our model. 

7. DISCUSSION 

Using attributes to strengthen defenses: Our evaluation largely 
focuses on how our model better matches the real-world SAN. We 
hypothesize that several attack defenses (e.g., Sybil proofing) can 
also be enhanced by taking into account the attribute structure. For 
example, we could check if the attribute structure of the nodes 
matches normal nodes, or even if an attacker manages to obtain 
a "compromised" edge to one node we can limit the influence of 
this compromised edge by checking the attribute structure. 

LAPA Computation: The LAPA model as described requires a 
costly linear time (in number of nodes) step when a new node ar- 
rives. This is because we have to consider the number of common 
attributes between the new node and each current node, unlike PA 
which only needs the global degree distribution. Fortunately, we 
can approximate LAPA using a practical heuristic. The high-level 
idea is to pick one of the new node's attributes at random and use 
PA within the nodes having this attribute. This approximates LAPA 
as nodes sharing more attributes are more likely to get selected. 

Dynamic attributes: Our model currently focuses on static at- 
tributes that nodes pick when they join the SAN. In our future work, 
we plan to incorporate dynamic attributes, and investigate whether 
the static attribute structure also influences the selection of dynamic 
attributes. Note that static attributes influence the social structure in 
our model while the dynamic attributes are influenced by the social 
structure in the model from Zheleva et al |61 1. 

Parameter inference: We currently use a guided greedy search 
to empirically estimate model parameters. While this works quite 
well, we plan to develop a more rigorous parameter inference algo- 
rithm based on maximum-likelihood principle |31||57| . 

Parsimoniousness of our model: In §[6] we have shown that each 
component of our model is necessary. However, it's an interesting 
future work to design a more parsimonious model. 

Implications for social network designs: Our results that users 
sharing common employer attributes are more likely to be linked 
than users sharing other attributes can help design a better friend 



recommendation system, which is a very fundamental component 
of online social networks. 

Relationship to heterogeneous networks: Our SAN can be viewed 
as a heterogeneous network since it consists of multiple types of 
nodes and links. Heterogeneous networks are shown recently to 
work better than traditional homogeneous networks for various data 
mining tasks such as link prediction | |17[ 58 48 
ference 1 17 58J and community detection 



491 



attribute in- 
It is an 



interesting future work to generalize our new attribute-related met- 
rics and generative model to other heterogeneous networks. 

8. RELATED WORK 

Given the growing role of social networks in users' lives and 
the potential for using such insights for building better systems 
and applications, there is a rich literature on measuring and mod- 
eling social networks. Next, we discuss our work in the context of 
this related work. At a high-level, our specific new contributions 
are: (1) we characterize the evolution of a new large-scale network 
(Google+), and (2) we provide measurement-driven insights and 
models on the impact of attributes on social network evolution. 

Measuring social networks: Many prior efforts characterize so- 
cial networks using the network metrics we also describe in ^3]| 26 
|28||29|[38||25| . Most of these focus on static snapshots; a few no- 
table work also focus on evolutionary aspects similar to our work 1 3 
|56[[4). With multiple Google+ snapshots crawled around its public 
release, Schioberg et al. [46] studied a few network metrics, geo- 
graphic distribution of the users and links, and correlation of users' 
public information of Google+. 

Concurrently, Gonzalez et al. [18| characterize several key fea- 
tures of Google+ during its first 10 months, and compare them to 
those of Facebook and Twitter. Using a static Google+ snapshot 
crawled after its public release, Magno et al. [36j | identify the key 
differences between Google+ and Facebook and Twitter, study the 
adoption patterns of Google+ in different countries, and character- 
ize the variation of privacy concerns across different cultures. Zhao 
ct al. [60 1 study the early evolution of the Rcnrcn social network, 
and analyze its network dynamics at different granularities to de- 
termine their influence on individual users. While we follow the 
spirit of these works, our work is unique in terms of the specific 
dataset (i.e., three phases of Google+), the scale of the network, 
and the fact that we had a singular opportunity to study the evolu- 
tion across different phases. 

There has been recent realization of the importance of user at- 
tributes in characterizing social networks 1 38 , 61 1. These focus on 
the influence of social structure on dynamic node attributes (e.g., 
interest groups). Our work focuses on the orthogonal dimension of 
analyzing and modeling the influence of static node attributes on 
social structure formation using Google+. 

Modeling social networks: There are two broad classes of models 
for generating social networks: static and dynamic. Static models 
try to reproduce a single static network snapshot 1 15 . 55 37]|47|. 
Dynamic models can provide insights on how nodes arrive and cre- 
ate links; these include models such as preferential attachment (5), 
copying [24|, nearest neighbor [2], forest fire |33| . Sala et al. |43| 
evaluated such models using both network metrics and application 
benchmarks and showed that the nearest neighbor model outper- 
forms others. The dynamic/generative model by Leskovec et al. 
mimics the nearest neighbor model in a dynamic setting [29|, and 
thus we use it as our starting point in ^5] However, these models are 
known to generate networks with power-law degree distributions. 
Many social networks including Google+, however, exhibit lognor- 
mal degree distributions [16 32 34J. Our dynamic model extends 



these prior work to provably generate a lognormal distribution for 
social outdegree. Our model also provides a more general frame- 
work by capturing both social and attribute structure. 

Modeling social-attribute networks: There has been relatively 
little work on generating SANs, though a few recent work jointly 
generating both social structure and node attributes can be viewed 
as SAN models; the most relevant work is from Zheleva et al. 1 61 1 
and Kim and Leskovec [23|. Zheleva et al. |61| focus on dynamic 
attributes; their model generates undirected networks with power- 
law distribution for social degree and non-lognormal distribution 
for attribute degree (see Figure \16\ . Kim and Leskovec model 
the social and attribute structure simultaneously |23|. Here, both 
the social degree of attribute nodes and attribute degrees of social 
nodes follow binomial distribution, which differs from empirically 
observed SANs. Our model can generate SANs that we confirm 
through both analysis and simulations to be consistent with real 
SANs. 

9. CONCLUSION 

Using a unique dataset collected by crawling Google+ since its 
launch in June 201 1, we provide a first-principled understanding of 
the attribute structure and its impact on the social structure and their 
evolutions with the SAN model. We observe several interesting 
phenomena in the structure and evolution of Google+. For exam- 
ple, the social degree distributions are lognormal, the assortativity 
is neural while many other social networks have positive assortativ- 
ities, and the distinct phases in the evolution manifest themselves 
in the network structure. We also provide new metrics for charac- 
terizing the attribute structure and demonstrate that attributes can 
significantly impact the social structure. Building on these empir- 
ical insights, we provide a new generative model for SANs and 
validate that it is close to the real Google+ SAN using both net- 
work metrics and real application contexts. We believe that our 
work is one of the first steps in this regard and that there are several 
interesting directions for future work to harness the power of us- 
ing the attribute structure for designing better social network based 
systems and applications. 
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APPENDIX 

A. A CONSTANT TIME ALGORITHM FOR 
APPROXIMATING CLUSTERING COEF- 
FICIENTS 

Before going to details of the algorithm and analysis, we intro- 
duce a few notations. In both directed or undirected SANs, a triple 
t consists of three nodes (v, u, w) satisfying v, w G r s (it), where 



Algorithm 2: Constant Time Approximate Algorithm for Com- 
puting the Average Clustering Coefficient 

Input: (SAN, fl, K), where SAN = (V 3 , V a , E s ,E a ), fl is the set 
of nodes whose average clustering coefficient Cq is 
approximated and K is the number of samples needed. 

Output: Approximate average clustering coefficient Crj. 

1 begin 

2 L< — 

3 k -s — 

4 while k < K do 
k < — fc + 1 

Sample a node u uniformly at random from fl 
Sample a pair of nodes v and w uniformly at random from 
u's social neighbors F s (u) 

L < — L + F(v,u,w) 

end 

L/(2 r K) 



10 

it end 



Cn 



u is called the center and v, w are called the endpoints of t. More- 
over, at and f3 t denote respectively the center node and the two 
endpoints of t. 

For a directed SAN and a set of triples T, we define a mapping 
F : T — ► {0, 1, 2}, where F(t = (v, u, w)) = if v and w 
are not connected, F(t = [v,u, w)) = I if they are connected 
by one directed link and F(t = (v, u, w)) — 2 if they are recip- 
rocally linked. For an undirected SAN, the mapping is defined as 
F : T — > {0, 1}, where Fit = (v,u,w)) = if v and w are 
not connected, otherwise F(t = (v, u, w)) = 1. Let I be an in- 
dicator variable of the directedness of a SAN, where 1 = when 
the SAN is undirected, otherwise 1=1. With the indicator vari- 
able /, we have < F(t) < 2 1 , which is useful for deriving the 
approximation bounds in the follows. 

For any set of nodes fl, their average clustering coefficient can be 
represented as C7 n = i J2 uS n c ( u ) = 2 ~' E teTri \n\r(. at ) F ( t )' 
where To = {t\a t G fi} and r{a t ) = §|r s (a t )j(|r s (a t )| - 1) 
is the number of triples whose center node is a t . If t is a uni- 
formly distributed random variable over fl, then we have Cn = 
2~ J E[F(t)]. This observation informs us the design of our ap- 
proximate algorithm, which is shown in Algorithm [2] Our algo- 
rithm computes the average social clustering coefficient when set- 
ting fl = V s , and the average attribute clustering coefficient when 
setting fl = V a ■ Note that our algorithm can also be used to com- 
pute average clustering coefficient distribution with respect to node 
degrees. The following theorem bounds the error of our algorithm. 

THEOREM 3. With the number of samples K = |"^tt], the 
approximated average clustering coefficient Cn output by our al- 
gorithm satisfies \ Cn — Cn\ < e with probability at least 1 — i. 

PROOF. Assume ti,ia, • • ■ ,tx are K independently and uni- 
formly distributed random variables over the triple set Tn- Then 
we have Cn = E[jr= J2iLi (*i)]< According to Hoeffding's 
bound [20], we obtain 



Thus, 



Pr(\Cn -C a \ < e) > 1- 2e~ 



So we get K = by setting v = 2e 



□ 



